Subspace Approximation for Approximate Nearest Neighbor Search in NLP

نویسنده

  • Jing Wang
چکیده

Most natural language processing tasks can be formulated as the approximated nearest neighbor search problem, such as word analogy, document similarity, machine translation. Take the questionanswering task as an example, given a question as the query, the goal is to search its nearest neighbor in the training dataset as the answer. However, existing methods for approximate nearest neighbor search problem may not perform well owing to the following practical challenges: 1) there are noise in the data; 2) the large scale dataset yields a huge retrieval space and high search time complexity. In order to solve these problems, we propose a novel approximate nearest neighbor search framework which i) projects the data to a subspace based spectral analysis which eliminates the influence of noise; ii) partitions the training dataset to different groups in order to reduce the search space. Specifically, the retrieval space is reduced from O(n) to O(log n) (where n is the number of data points in the training dataset). We prove that the retrieved nearest neighbor in the projected subspace is the same as the one in the original feature space. We demonstrate the outstanding performance of our framework on real-world natural language processing tasks.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Fast Large-Scale Approximate Graph Construction for NLP

Many natural language processing problems involve constructing large nearest-neighbor graphs. We propose a system called FLAG to construct such graphs approximately from large data sets. To handle the large amount of data, our algorithm maintains approximate counts based on sketching algorithms. To find the approximate nearest neighbors, our algorithm pairs a new distributed online-PMI algorith...

متن کامل

IRWIN AND JOAN JACOBS CENTER FOR COMMUNICATION AND INFORMATION TECHNOLOGIES A General Framework for Approximate Nearest Subspace Search

Subspaces offer convenient means of representing information in many Pattern Recognition, Machine Vision, and Statistical Learning applications. Contrary to the growing popularity of subspace representations, the problem of efficiently searching through large subspace databases has received little attention in the past. In this paper we present a general solution to the Approximate Nearest Subs...

متن کامل

FLAG: Fast Large-Scale Graph Construction for NLP

Many natural language processing (NLP) problems involve constructing large nearest-neighbor graphs between word pairs by computing distributional similarity between word pairs from large corpora. In this paper, first we describe a system called FLAG to construct such graphs approximately from large data sets. To handle the large amount of data in memory and time efficient manner, FLAG maintains...

متن کامل

Approximate nearest neighbor search for $\ell_p$-spaces ($2<p<\infty$) via embeddings

While the problem of approximate nearest neighbor search has been well-studied for Eu-clidean space and ℓ 1 , few non-trivial algorithms are known for ℓ p when 2 < p < ∞. In this paper, we revisit this fundamental problem and present approximate nearest-neighbor search algorithms which give the first non-trivial approximation factor guarantees in this setting.

متن کامل

Nearest Neighbor Search using Kd-trees

We suggest a simple modification to the kd-tree search algorithm for nearest neighbor search resulting in an improved performance. The Kd-tree data structure seems to work well in finding nearest neighbors in low dimensions but its performance degrades even if the number of dimensions increases to more than three. Since the exact nearest neighbor search problem suffers from the curse of dimensi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1708.07775  شماره 

صفحات  -

تاریخ انتشار 2017